177 research outputs found

    Full-fledged Real-Time Indexing for Constant Size Alphabets

    Full text link
    In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet ofconstant size. Each new symbol can be prepended to TT in O(1) worst-case time. At any moment, we can report all occurrences of a pattern PP in the current text in O(P+k)O(|P|+k) time, where P|P| is the length of PP and kk is the number of occurrences. This resolves, under assumption of constant-size alphabet, a long-standing open problem of existence of a real-time indexing method for string matching (see \cite{AmirN08})

    Reconsidering the significance of genomic word frequency

    Full text link
    We propose that the distribution of DNA words in genomic sequences can be primarily characterized by a double Pareto-lognormal distribution, which explains lognormal and power-law features found across all known genomes. Such a distribution may be the result of completely random sequence evolution by duplication processes. The parametrization of genomic word frequencies allows for an assessment of significance for frequent or rare sequence motifs

    Estimating seed sensitivity on homogeneous alignments

    Get PDF
    We address the problem of estimating the sensitivity of seed-based similarity search algorithms. In contrast to approaches based on Markov models [18, 6, 3, 4, 10], we study the estimation based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies. We provide experimental results demonstrating a bias introduced by ignoring the homogeneousness condition

    RNF: a general framework to evaluate NGS read mappers

    Get PDF
    Aligning reads to a reference sequence is a fundamental step in numerous bioinformatics pipelines. As a consequence, the sensitivity and precision of the mapping tool, applied with certain parameters to certain data, can critically affect the accuracy of produced results (e.g., in variant calling applications). Therefore, there has been an increasing demand of methods for comparing mappers and for measuring effects of their parameters. Read simulators combined with alignment evaluation tools provide the most straightforward way to evaluate and compare mappers. Simulation of reads is accompanied by information about their positions in the source genome. This information is then used to evaluate alignments produced by the mapper. Finally, reports containing statistics of successful read alignments are created. In default of standards for encoding read origins, every evaluation tool has to be made explicitly compatible with the simulator used to generate reads. In order to solve this obstacle, we have created a generic format RNF (Read Naming Format) for assigning read names with encoded information about original positions. Futhermore, we have developed an associated software package RNF containing two principal components. MIShmash applies one of popular read simulating tools (among DwgSim, Art, Mason, CuReSim etc.) and transforms the generated reads into RNF format. LAVEnder evaluates then a given read mapper using simulated reads in RNF format. A special attention is payed to mapping qualities that serve for parametrization of ROC curves, and to evaluation of the effect of read sample contamination

    Linear pattern matching on sparse suffix trees

    Get PDF
    Packing several characters into one computer word is a simple and natural way to compress the representation of a string and to speed up its processing. Exploiting this idea, we propose an index for a packed string, based on a {\em sparse suffix tree} \cite{KU-96} with appropriately defined suffix links. Assuming, under the standard unit-cost RAM model, that a word can store up to logσn\log_{\sigma}n characters (σ\sigma the alphabet size), our index takes O(n/logσn)O(n/\log_{\sigma}n) space, i.e. the same space as the packed string itself. The resulting pattern matching algorithm runs in time O(m+r2+rocc)O(m+r^2+r\cdot occ), where mm is the length of the pattern, rr is the actual number of characters stored in a word and occocc is the number of pattern occurrences

    On the combinatorics of suffix arrays

    Get PDF
    We prove several combinatorial properties of suffix arrays, including a characterization of suffix arrays through a bijection with a certain well-defined class of permutations. Our approach is based on the characterization of Burrows-Wheeler arrays given in [1], that we apply by reducing suffix sorting to cyclic shift sorting through the use of an additional sentinel symbol. We show that the characterization of suffix arrays for a special case of binary alphabet given in [2] easily follows from our characterization. Based on our results, we also provide simple proofs for the enumeration results for suffix arrays, obtained in [3]. Our approach to characterizing suffix arrays is the first that exploits their relationship with Burrows-Wheeler permutations
    corecore